Standoff properties as an alternative to XML for digital historical editions

نویسنده

  • Desmond Schmidt
چکیده

In the past few years digital humanists have recognised that the development of tools for the collaborative creation of shareable, reusable and archivable transcriptions of historical documents is being held back by the complexity and lack of interoperability of customisable XML encoding standards. To fix this problem transcriptions must be split into their constituent parts: a plain text and individual sets of markup. The plain text is easily processable by text-analysis software or search engines, and may be divided into layers representing different states of each document. Each markup set may also represent a separate layer of information to enrich an underlying plain text, and may be freely combined with others. Unlike in XML, textual properties in markup sets can freely overlap. Plain text versions and their corresponding markup can also be reliably and efficiently converted into HTML, the language of the Web. Unlike previous attempts at separating markup from text, the proposed data representation is designed to be fully editable, and development of a suitable Web-based editor is at an advanced stage. To solve the problems we now face will require a radical rethink of how we mark up texts, and the adoption of globally interoperable standards to which these internal representations can be easily converted, such as HTML, plain text and RDFa. 5

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Digital Editions beyond XML - Graph-based Digital Editions

XML has been the de facto standard for digital editions for years, but its serious limitations include an inability to represent overlapping markup and the encoding of multiple annotation hierarchies. With emerging graph database technologies we have the opportunity to develop new approaches. In this paper the advantages and modelling principles of graph-based digital editions will be discussed.

متن کامل

A framework for processing and presenting parallel text corpora

This thesis describes an extensible framework for the processing and presentation of multi-modal, parallel text corpora. It can be used to load digital documents in many formats like for example pure text, XML or bit-mapped graphics, to structure these documents with a uniform markup and link them together. The structuring or tagging can be done with respect to formal, linguistic, semantic, his...

متن کامل

Creating an XML Vocabulary for Encoding Lute Music

We describe the development of an XML representation, called TabXML, for encoding historical sources of lute music. These sources employ a special notation type, tablature, that is very hard to understand for non-lutenists. This paper discusses several issues in creating TabXML: 1. what to represent: the notational meaning or the text of the tablature, and how to represent it; 2. an analysis of...

متن کامل

Representing and Querying Standoff XML

The paper discusses the representation and exploitation of multi-level annotated linguistic data. We first present a standoff XML representation, which distributes information over separate, standoff layers and allows us to represent annotations of various kinds in a uniform, generic way. This format serves as our interchange format. We further introduce an XML-inline representation that is des...

متن کامل

An Integrated Tool for Annotating Historical Corpora

E-Dictor is a tool for encoding, applying levels of editions, and assigning part-ofspeech tags to ancient texts. In short, it works as a WYSIWYG interface to encode text in XML format. It comes from the experience during the building of the Tycho Brahe Parsed Corpus of Historical Portuguese and from consortium activities with other research groups. Preliminary results show a decrease of at leas...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016